Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Post Import Service #4370

Open
wants to merge 2 commits into
base: 2.x
Choose a base branch
from
Open

Refactor Post Import Service #4370

wants to merge 2 commits into from

Conversation

kaise-lafrai
Copy link
Contributor

@kaise-lafrai kaise-lafrai commented Dec 23, 2024

QA Steps

Run the following tests.

  • ddev dkan-phpunit --filter PostImportTest
  • ddev dkan-phpunit --filter dictionaryenforcertest

Successful Post Import

  • Set up vanilla site
  • ddev drush en sample_content
  • ddev drush dkan-sample-content:create
  • Lets work with the 'Florida Bike Lanes' dataset otherwise known as node::8
  • Create a data dictionary at/node/add/data?schema=data-dictionary
  • Title OBJECTID
  • Select Add field
  • Name OBJECTID
  • Title OBJECTID
  • Data type Integer
  • Description OBJECTID
  • Grab the Identifier for later and select Save
  • Navigate to DD settings : /admin/dkan/data-dictionary/settings
  • Select Sitewide
  • Add the identifier from previous step into Sitewide Dictionary ID
  • Save configuration
  • Run Cron a few times
  • Confirm DD applied with no issues.

Unsuccessful Post Import (Drop Resource)

  • Navigate here and select Last Update and Save /admin/dkan/datastore
  • Update the florida bike lane node last update date time and save
  • Navigate here /admin/dkan/resources and select Drop the datastore table and save.
  • Navigate to the data dictionary created before
  • Alter the field "Data type" to something like "Date" and save
  • Run cron a few times
  • Navigate to dashboard /admin/dkan/datastore/status
  • Confirm that the dataset cedcd327-4e5d-43f9-8eb1-c11850fa7c55 has a Store value of waiting(0%) with an error in the Post Import column.
  • Confirm that the dataset was dropped by trying to access via the api and getting status of 400 /api/1/datastore/query/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/0

Data Dictionary Disabled

  • Navigate to data dictionary settings here admin/dkan/data-dictionary/settings and disable dictionary mode
  • Update the florida bike lane node last update date time and save
  • Run cron a few times
  • Navigate to dashboard /admin/dkan/datastore/status
  • Confirm that the dataset cedcd327-4e5d-43f9-8eb1-c11850fa7c55 has a Post Import status of N/A Data-Dictionary Disabled

Data Dictionary via Distribution Reference

  • Visit admin/dkan/data-dictionary/settings and set the data dictionary mode to "distribution reference."
  • Post the following data dictionary to your DKAN instance:
{
    "title": "Bike lanes data dictionary",
    "data": {
        "fields": [
            {
                "name": "objectid",
                "title": "OBJECTID",
                "type": "integer",
                "description": "Internal feature number."
            },
            {
                "name": "roadway",
                "title": "ROADWAY",
                "type": "string",
                "description": "A unique 8-character identification number assigned to a roadway or section of a roadway either On or Off the State Highway System for which information is maintained in the Department's Roadway Characteristics Inventory (RCI)."
            },
            {
                "name": "road_side",
                "title": "ROAD_SIDE",
                "type": "string",
                "constraints": {
                    "maxLength": 1,
                    "minLength": 1,
                    "enum": ["R", "L", "C"]
                },
                "description": "Side of the road. C = Composite; L = Left side; R = Right side"
            },
            {
                "name": "lncd",
                "title": "LNCD",
                "type": "integer",
                "constraints": {
                    "maxLength": 1,
                    "minLength": 1,
                    "maximum": 5,
                    "minimum": 0
                },
                "description": "Codes 0 = UNDESIGNATED; 1 = DESIGNATED; 2 = BUFFERED; 3 = COLORED; 4 = BOTH 2 AND 3; 5 = SHARROW"
            },
            {
                "name": "descr",
                "title": "DESCR",
                "type": "string",
                "constraints": {
                  "maxLength": 30,
                  "enum": ["UNDESIGNATED", "DESIGNATED"]
                },
                "description": "Designation description."
            },
            {
                "name": "begin_post",
                "title": "BEGIN_POST",
                "type": "number",
                "description": "Denotes the lowest milepoint for the record."
            },
            {
                "name": "end_post",
                "title": "END_POST",
                "type": "number",
                "description": "Denotes the highest milepoint for the record."
            },
            {
                "name": "shape_len",
                "title": "Shape_Leng",
                "type": "number",
                "description": "Length in meters"
            }
        ]
    }
}
  • Post the following dataset to your DKAN instance.
{
    "@type": "dcat:Dataset",
    "accessLevel": "public",
    "contactPoint": {
        "fn": "Jane Doe",
        "hasEmail": "mailto:[email protected]"
    },
    "description": "Data on bike lanes in Florida.",
    "distribution": [
    {
        "@type": "dcat:Distribution",
        "downloadURL": "https://demo.getdkan.org/sites/default/files/distribution/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/Bike_Lane.csv",
        "mediaType": "text\/csv",
        "format": "csv",
        "title": "Florida Bike Lanes",
        "describedBy": "dkan://metastore/schemas/data-dictionary/items/b063524c-cf33-5d06-a161-770fc688d53b",
        "describedByType": "application/vnd.tableschema+json"
    }
    ],
    "identifier": "cedcd327-4e5d-43f9-8eb1-c11850fa7c55",
    "issued": "2016-06-22",
    "license": "http://opendatacommons.org/licenses/by/1.0/",
    "modified": "2016-06-22",
    "publisher": {
        "@type": "org:Organization",
        "name": "State Economic Council"
    },
    "spatial": "Florida",
    "theme": [
        "Transportation",
        "City Planning"
    ],
    "title": "Florida Bike Lanes ",
    "keyword":["bike-lanes", "streets", "infrastructure"]
}
  • Run the import and post_import queues. The dictionary should be applied.

Resource does not have dictionary

QA:

  • Set the DKAN site to use 'Distribution reference' data dictionaries.
  • Import datasets with no associated data dictionaries.
  • Verify that no errors are thrown during the post_import phase, but instead an informative message is displayed.

Helpful script:

ddev dkan-site-install
ddev drush pm-enable sample_content
ddev drush dkan:sample-content:create
ddev drush config:set metastore.settings data_dictionary_mode reference
ddev drush cron
ddev drush cron

You should see messages like this:

 [notice] No data-dictionary found for resource with id "0310f6c2d28bbee5de1211e5b5bb0b1c" and version "1721843317".

You should then also see a message like this in the dataset import status dashboard:

done Resource 0310f6c2d28bbee5de1211e5b5bb0b1c does not have a data dictionary.

@dafeder
Copy link
Member

dafeder commented Dec 24, 2024

Going through the QA, I hit an issue at Unsuccessful Post Import (Drop Resource) > Confirm that the dataset cedcd327-4e5d-43f9-8eb1-c11850fa7c55 has a Store value of waiting(0%) with an error in the Post Import column.

When I ran cron I got the following error:

$ ddev drush cron
 [notice] Localization of resource 3a187a87dc6cd47c48b6b4c4785224b7: 
 [notice] ImportService for 3a187a87dc6cd47c48b6b4c4785224b7__1735074165 completed.
 [error]  SQLSTATE[22007]: Invalid datetime format: 1292 Incorrect date value: '2049' for column 'objectid' at row 1 
 [error]  Failed to drop datastore for aba79c7a62fec6c2e49bc1cebd0d7233. No datastore storage found for 3a187a87dc6cd47c48b6b4c4785224b7:1735068191. 

On the dashboard, I do see the right error in the postImport column but the store says done:

image

I can confirm the datastore was not dropped and the objectId has reverted back to text (it was int after the previous QA steps):

GET https://kaise.ddev.site/api/1/datastore/query/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/0?results=false&count=false

returns

{
  "schema": {
    "699890c9-480c-5f2e-89c5-06e9723303fb": {
      "fields": {
        "objectid": {
          "type": "text",
          "mysql_type": "text",
          "description": "OBJECTID"
        },
        "roadway": {
          "type": "text",
          "mysql_type": "text",
          "description": "ROADWAY"
        },
        "road_side": {
          "type": "text",
          "mysql_type": "text",
          "description": "ROAD_SIDE"
        },
        "lncd": {
          "type": "text",
          "mysql_type": "text",
          "description": "LNCD"
        },
        "descr": {
          "type": "text",
          "mysql_type": "text",
          "description": "DESCR"
        },
        "begin_post": {
          "type": "text",
          "mysql_type": "text",
          "description": "BEGIN_POST"
        },
        "end_post": {
          "type": "text",
          "mysql_type": "text",
          "description": "END_POST"
        },
        "shape_leng": {
          "type": "text",
          "mysql_type": "text",
          "description": "Shape_Leng"
        }
      }
    }
  },
  "query": {
    "results": false,
    "count": false,
    "resources": [
      {
        "id": "699890c9-480c-5f2e-89c5-06e9723303fb",
        "alias": "t"
      }
    ],
    "limit": 500,
    "offset": 0,
    "schema": true,
    "keys": true,
    "format": "json",
    "rowIds": false,
    "properties": [
      "objectid",
      "roadway",
      "road_side",
      "lncd",
      "descr",
      "begin_post",
      "end_post",
      "shape_leng"
    ]
  }
}

Copy link
Member

@dafeder dafeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! See the question on QA in the PR conversation.

Also, reviewing how the code works there is something off to me in the back and forth between PostImport and PostImportResult. It's just strange to me that PostImport puts this data into an object, which has a storeResult() method, but the storeResult() method just calls another method that's back in PostImport(). It's essentially a circular dependency, though Drupal won't notice because PostImportResult is not a service. I think the correct way to do this would be one of two things:

  1. Make the entire storeResult() method on PostImportResult be essentially self-contained, so it contains the DB query and just needs to be passed a Connection object (ie the database service). Remove PostImport::storeJobStatus(). This would also remove Connection as a dependency for PostImport.
  2. Remove the storeResult() method from PostImportResult, and simply have PostImport::storeJobStatus take a PostImportResult object as its argument, rather than all those individual arguments. This means that rather than rather than calling $result->storeResult(), PostImportResourceProcessor::processItem() would call $this->postImport->storeJobStatus($result).

I'd like to see this changed but if it feels out of scope we can make a new ticket for it, tell me what you think.

@kaise-lafrai
Copy link
Contributor Author

kaise-lafrai commented Dec 31, 2024

@dafeder Can you confirm you selected Drop the datastore table if the post import queue reports an error. on the Resource settings page (admin/dkan/resources) before attempting to run a failing post import?

I attempted the QA steps and was able to see "waiting (0%)" under the store column.

@kaise-lafrai
Copy link
Contributor Author

kaise-lafrai commented Jan 2, 2025

@dafeder I'm attempting to refactor so that I am satisfying this approach.

Make the entire storeResult() method on PostImportResult be essentially self-contained, so it contains the DB query and just needs to be passed a Connection object (ie the database service). Remove PostImport::storeJobStatus(). This would also remove Connection as a dependency for PostImport.

I would not be able to remove connection as a dependecy for PostImport because I would need to pass that in when instantiating PostImportResult. I will also need to add the connection service in the DashboardForm since I moved over all the db related code out of the PostImport class into the PostImportResult class.

@dafeder
Copy link
Member

dafeder commented Jan 3, 2025

@dafeder Can you confirm you selected Drop the datastore table if the post import queue reports an error. on the Resource settings page (admin/dkan/resources) before attempting to run a failing post import?

I attempted the QA steps and was able to see "waiting (0%)" under the store column.

I think so but possible I missed a step. I'll try again on Monday.

@dafeder
Copy link
Member

dafeder commented Jan 3, 2025

@dafeder I'm attempting to refactor so that I am satisfying this approach.

Make the entire storeResult() method on PostImportResult be essentially self-contained, so it contains the DB query and just needs to be passed a Connection object (ie the database service). Remove PostImport::storeJobStatus(). This would also remove Connection as a dependency for PostImport.

I would not be able to remove connection as a dependecy for PostImport because I would need to pass that in when instantiating PostImportResult. I will also need to add the connection service in the DashboardForm since I moved over all the db related code out of the PostImport class into the PostImportResult class.

You're right, apologies for not thinking about it more carefully. I had in my head there was a way to get the connection directly from DatastoreService but it looks like there is not. Disregard.

@kaise-lafrai
Copy link
Contributor Author

@dafeder I'm attempting to refactor so that I am satisfying this approach.

Make the entire storeResult() method on PostImportResult be essentially self-contained, so it contains the DB query and just needs to be passed a Connection object (ie the database service). Remove PostImport::storeJobStatus(). This would also remove Connection as a dependency for PostImport.

I would not be able to remove connection as a dependecy for PostImport because I would need to pass that in when instantiating PostImportResult. I will also need to add the connection service in the DashboardForm since I moved over all the db related code out of the PostImport class into the PostImportResult class.

I was able to use the database object like this $connection = \Drupal::database(); in the PostImportResult class. This way I can avoid injecting it into in the PostImport and DashboardForm classes. Pros and Cons doing it this way but we do satisfy code climate by reducing the number of arguments in the constructor.

@dafeder
Copy link
Member

dafeder commented Jan 3, 2025

Let's stick with the dependency injection pattern, the argument limit from CC is a good guideine but not an absolute rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants